In this project, we will try to find the best zip code(s) in which to open a premium services men's hair salon in San Antonio, Texas. This is the type of barbershop where customer's pay a premium for stellar atmosphere, styling, and service. Indicative of this type of business is that it is typical that beer or other types of alcohol are given away as 'part of the service.'
Selecting the best zip code to open this type of business allows a potential client to narrow down their search for a commercial storefront to the most promising places. The criteria that we are looking for in determining the best zip codes are as follows:
We will use our data science powers to generate the best locations for this business by zip code based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.
Based on definition of our problem, factors that will influence our decision are:
Our location data will organized by zip code because US Census Data and other types of demographic information collection are often organized by zip code. This provides us with data that tells us the income and family size data of the residents of a particular zip code.
Following data sources will be needed to extract/generate the required information:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
filename = 'Zip Codes San Antonio.csv' #I copied and pasted this data into CSV format from:
# "https://www.zip-codes.com/city/tx-san-antonio.asp#zipcodes"
table = pd.read_csv(filename)
table.head() #all tables displayed as head(10) for cleaner HTML presentation
Let's clean this data.
ZipCodes = table # Names the df
ZipCodes1 = ZipCodes.drop(columns = ['Area Code(s)','County']) #Drop the Area Code and County Columns
indexnames = ZipCodes1[ZipCodes1['Type'] == 'P.O. Box' ].index #ID the P.O box Zip Codes
ZipCodes2 = ZipCodes1.drop(indexnames) #Drop Zip codes that are PO Box only
indexnames = ZipCodes2[ZipCodes2['Population'] == '0'].index
ZipCodes3 = ZipCodes2.drop(indexnames) #Drop Zip Codes where population equals zero
print(ZipCodes3['ZIP Code'].nunique()) #Verify no duplicates
print(ZipCodes3['Type'].nunique()) #Verify all Zip Codes are standard type
ZipCodes3['ZIP Code'] = ZipCodes3['ZIP Code'].str.replace('ZIP Code ','',regex=True) #Format Zip Codes as integers
ZipCodes4 = ZipCodes3.drop(columns = 'Type') #Now that all zip code types are standard, we no longer need this column
ZIPlist = ZipCodes4['ZIP Code'].to_list()
For the GEOJSON, I downloaded the file opensource from Github, and placed it in my working directory. Thanks enactdev! Source URL https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/tx_texas_zip_codes_geo.min.json
Now that we have the relevant zip codes and a GeoJSON of the Zip Codes shape, we have most of the data we need to create a map of our initial search area. Let's grab the lat/long for San Antonio TX
# use the geolocator to get a center latlong for San Antonio
from geopy.geocoders import Nominatim
address = 'San Antonio, TX'
geolocator = Nominatim(user_agent="SATX_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of San Antonio are {}, {}.'.format(latitude, longitude))
For our map, we don't want to include Zip Codes outside of San Antonio. So let's change our texas zip codes JSON to a San Antonio Zip Codes JSON. I'm going to use mapshaper.org to help me do this.
The command interface for the mapshaper tool requires the a string to filter the original JSON to a new JSON file. I'll use python to build that string for me. I want the tool to 'filter' the ZIP Codes that are not part of my above dataframe so ultimately my command to the system ends up being "filter + "string"
output=[] #initialize a list
for i in range(0,61):
txt = 'ZCTA5CE10 == "{ZIP}"||'.format(ZIP = ZipCodes4.iloc[i,0])
output.append(txt)
string = "".join(output) #converts the list of strings into a single string
string
Output from the mapshaper website was the file SATX_zip_codes_geo.json. This was placed in the folder along with this notebook
import folium
SAmap = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.GeoJson(
'SATX_zip_codes_geo.json',
name='geojson1'
).add_to(SAmap)
folium.LayerControl().add_to(SAmap)
SAmap
So now we have 2 of our 4 data requirements completed: The list of San Antonio Zip Codes and their boundaries and the ability to map those. Now lets use the Zip-codes.com API to pull demographics for each zip code.
#Set up a test API request
key = 'VUX9CRM7QCX7Y5FKR7SS'
ZIPc = '78201'
URL = 'https://api.zip-codes.com/ZipCodesAPI.svc/1.0/GetZipCodeDetails/{ZIPcode}?key=<{KEY}>'.format(ZIPcode = ZIPc, KEY=key)
URL
import requests
from pandas import json_normalize
result = requests.get(URL).json()
sample = json_normalize(result)
sample.columns.to_list()
Our sample result returns 103 columns, so lets see if we can slice the information we need.
So lets systematically discover which columns we need to grab from the results for later.
filtered_columns = ['item.ZipCode',
'item.Latitude', #basic location data
'item.Longitude', #basic location data
'item.AreaLand', #used later to compute population density,
'item.ZipCodePopulation', #used later to compute population density,
'item.AverageHouseValue', #Economic Indicator
'item.IncomePerHousehold', #Economic Indicator
'item.Bus03Establishments', #Economic Indicator
'item.Bus03Employment', #Economic Indicator
'item.Bus03PayrollAnnual', #Economic Indicator
'item.MedianAge', #Social Indicator
'item.MedianAgeMale', #Social Indicator
'item.MedianAgeFemale', #Social Indicator
'item.AverageFamilySize', #Social Indicator
]
column_names = ['ZIPCode',
'Latitude',
'Longitude',
'Land_Area',
'Population',
'AverageHouseValue',
'IncomePerHousehold',
'Number_of_Businesses',
'Business_Employment',
'LocalBusinessPayroll',
'MedianAge',
'MedianAgeMale',
'MedianAgeFemale',
'AverageFamilySize',
]
#Defining a function to get the demographic data by ZIP code
def getzipcodesdemodata1(zipcodes,filtered_columns=filtered_columns):
#create a Dataframe to store the results
demo_data_df = pd.DataFrame(None)
for zipcode in zipcodes:
print(' .', end='')
# create the API request URL
key = 'VUX9CRM7QCX7Y5FKR7SS'
URL = 'https://api.zip-codes.com/ZipCodesAPI.svc/1.0/GetZipCodeDetails/{ZIP}?key=<{KEY}>'.format(
ZIP = zipcode,
KEY=key)
# make the GET request and normalize the json into a df
results = requests.get(URL).json()
results2 =json_normalize(results)
results3 = results2[filtered_columns] #Filter the datafrome and relabel the columns
# return only relevant information for each zipcode
demo_data_df= demo_data_df.append(results3, ignore_index=True)
demo_data_df.columns = column_names
return(demo_data_df)
ZIPdata = getzipcodesdemodata1(zipcodes = ZIPlist) # pulling the data
ZIPdata.head(10) #all tables displayed as head(10) for cleaner HTML presentation
We now have our demographic data so we can do some initial analysis.
newZIP = ZIPdata[column_names].astype('float')
It appears that some of our zipcodes have zero houses in them. This probably corresponds to a central business district where all the land is commercial and is not residential at all. Residential Population for this area may only be homeless people. I'll leave these zip codes in the analysis because business people getting premium haircuts is a potential source for customers as well as a residential population.
We have have very wide variance in Land Area that we should account for. If we don't account for the different sizes of zip codes, we will accidently end up clustering based on Zip Code size, since this logically effects the population in a zip code.
Looking at the values we have, there are a few transformations we can do to provide better data to the model and to the customer.
Once this is complete, it will be time to cluster the neighborhoods.
#Adds Population Density to the df
try:
newZIP.insert(5,'Population Density', newZIP.apply(lambda row: row.Population / row.Land_Area, axis=1),allow_duplicates=False)
except:
None
try:
newZIP.insert(11,'AVG_Paycheck', newZIP.apply(lambda row: row.LocalBusinessPayroll / row.Business_Employment, axis=1),allow_duplicates=False)
except:
None
newZIP.head(10) #all tables displayed as head(10) for cleaner HTML presentation
We have one last step: we'll drop the columns we don't need for K-means clustering
ZIP_demo_kmeans = newZIP.drop(['ZIPCode', #dropped because we're clustering off of demographics, not label/location
'Latitude', #dropped because we're clustering off of demographics, not label/location
'Longitude', #dropped because we're clustering off of demographics, not label/location
'Land_Area', #Better captured in population density column
'Population', #Better captured in population density column
'Number_of_Businesses', #Better captured in Average Paycheck column
'Business_Employment', #Better captured in Average Paycheck column
'LocalBusinessPayroll',#Better captured in Average Paycheck column
'MedianAge', #Females don't represent potential customers
'MedianAgeFemale', #Females don't represent potential customers
],1)
ZIP_demo_kmeans.head(3) # Df for Kmeans clustering
Lets normalize the data. We are using K-means clustering to cluster the neighborhoods mathematically, and then we'll interpret the clusters to decide which ones are best for premium barbershops.
from sklearn.preprocessing import StandardScaler
data = ZIP_demo_kmeans
scaler = StandardScaler()
scaler.fit(data)
transformed_ZIP_demo = scaler.transform(data)
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
# I'm setting the number at 6 because broadly,
#you can group our data into high and low values for
# 1)residential wealth indicators, 2)business wealth indicators, 3)and age/family indicators
kclusters = 6
# for iterating different kclusters
try: ZIP_demo_kmeans.drop(labels='Cluster_Labels',axis=1,inplace=True)
except: None
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=5).fit(transformed_ZIP_demo)
# add the clusterlabels to the original dataframe. The drop clause is added for
# iterating though different k-values
ZIP_demo_kmeans.insert(1,'Cluster_Labels',kmeans.labels_ ,allow_duplicates=False) #inserting for initial analysis and interpretation
newZIP.insert(1,'Cluster_Labels',kmeans.labels_ ,allow_duplicates=False) #inserting into this table for future plotting
try:
ZIP_demo_kmeans.insert(0,'ZIPCode',newZIP['ZIPCode'].astype(int),allow_duplicates=False)
except:
None
ZIP_demo_kmeans.head() #all tables displayed as head(10) for cleaner HTML presentation
Clustered_count = ZIP_demo_kmeans.groupby(['Cluster_Labels']).count()
Clustered = ZIP_demo_kmeans.groupby(['Cluster_Labels']).mean().round(2).reset_index() #Groups by cluster labels and creates a new df
Clustered_count # Shows cluster sizes
Clustered
The above shows the mean values for each feature of the clusters created in our K-means. Let's see if we can use a visualization to help us understand this data.
#Because we are graphing, and we have very large quanities along w/ small quantities in our df, lets transform the data for visualization
data = Clustered
scaler = StandardScaler()
scaler.fit(data)
transformed_Clustered = scaler.transform(data) #returns a np.array
transformed_Clustered1 =pd.DataFrame(data=transformed_Clustered, columns=['Cluster_Labels',
'ZIPCode',
'Population Density',
'AverageHouseValue',
'IncomePerHousehold',
'AVG_Paycheck',
'MedianAgeMale',
'AverageFamilySize']) #Converts output to a df again
transformed_Clustered1['Cluster_Labels'] = Clustered['Cluster_Labels'] #replace normalized cluster labels for regular labels again.
transformed_Clustered1.drop(columns = 'ZIPCode',inplace=True,errors='ignore')
transformed_Clustered1
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
sns.set(style="whitegrid", palette="muted",font_scale=2)
# "Melt" the dataset to "long-form" or "tidy" representation
melted_clustered = transformed_Clustered1.melt(id_vars='Cluster_Labels', var_name='Cluster Labels', value_name='Features', col_level=None)
# Draw a categorical scatterplot to show each observation
sns.catplot(x="Cluster Labels", y="Features", hue="Cluster_Labels",
data=melted_clustered, kind="point", height=10, aspect=2);
So now we have a visual way of interpreting our clusters. We seek to answer the question: Which clusters would be good zip codes to locate our business? And which clusters of zip codes should we not locate our business?
Now we can rank order our Clusters from Best to worst as 3,1,4,2,0,5. I want to visualize this data as a choropleth map, with the shading showing the most desirable zip codes by color. But to do this, we'll have to reassign cluster labels to match our worst to best ranking because the choropleth is designed to show a continuous variable instead of discrete variables
newlabels = {5:1,
0:2,
2:3,
4:4,
1:5,
3:6} #New Cluster Labels, where Higher labels indicate better zones to locate the business
updated_ZIP_demo_kmeans = ZIP_demo_kmeans.replace(to_replace=newlabels, value=None, inplace=False) #replace the labels for the dataframe for plotting
ZIPdata = ZIPdata.replace(to_replace=newlabels, value=None, inplace=False) #replace the labels in newZip df for later use
#because ZIP code in the JSON file is a string and they need to match for the Choropleth mapping
updated_ZIP_demo_kmeans['ZIPCode']= ZIP_demo_kmeans['ZIPCode'].astype(str)
SAmap1 = folium.Map(location=[latitude, longitude], zoom_start=9)
SAmap1.choropleth(
geo_data='SATX_zip_codes_geo.json',
data=updated_ZIP_demo_kmeans,
columns=['ZIPCode','Cluster_Labels'],
key_on='feature.properties.ZCTA5CE10',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Cluster Labels',
reset=True,
name='Demographic Shading',
highlight=True
)
folium.LayerControl().add_to(SAmap1)
SAmap1
Now that we have a map that shows favorable areas, we can do two things simultaneously:
Let's get crackin' on the foursquare data. I'm going to pull the name and location of all the salons and barbershops in foursquare for the Zip codes of San Antonio. From this list, I'll have to categorize which barbershops are economical and which are boutique or premium. From there, I can plot those places on my folium map, using different colored markers to show the category of barbershop.
#Defining a function to get barbershops by ZIP code
def getBarberShopsByZip(zipcodes):
venues_list=[]
for zipcode in zipcodes:
print(' .', end='')
CLIENT_ID = '1QNOLEFLTMLRHL30MQRLDFKW4GDEG3XLHBSTYJKE0CXS1I4H' # your Foursquare ID
CLIENT_SECRET = 'AJCPBRHC3FC0EJCQJFFIJFT2A0NXOKUFNAXCZ0DHDVN0MPHX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
#Pulled from https://developer.foursquare.com/docs/build-with-foursquare/categories/
BarberCategoryID = '4bf58dd8d48988d110951735'
intent = 'match'
LIMIT=50
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&categoryId={}&intent={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
zipcode,
BarberCategoryID,
intent,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# I used the below code for saving an intial JSON to figure out paths to the data I wanted
# import json
# with open('foursquaretest.json', 'w', encoding='utf-8') as f:
# json.dump(results, f, ensure_ascii=False, indent=4)
# return only relevant information for each nearby venue
venues_list.append([(
zipcode,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['ZIP_Searched',
'Venue_Name',
'Latitude',
'Longitude']
return(nearby_venues)
SATXbarbers = getBarberShopsByZip(zipcodes= ZIPlist)
SATXbarbers.head(10) #all tables displayed as head(10) for cleaner HTML presentation
print(SATXbarbers.shape)
SATXbarbers.sort_values('Venue_Name',axis=0, inplace=True)
namelist = SATXbarbers['Venue_Name'].values.tolist()
So, from the above, we can tell that there are 2215 barbershops and salons in SATX. We need to process this list to make more sense of it. Firstly, it looks like our Foursquare API pulled duplicates as it searched by zip code. So we'll need to eliminate the duplicates.
Also, We're interested in places where men can get their hair cut, so let's eliminate:
Once our list contains ordinary and premium barbershops, we'll look to classify them as either economy or premium barbershops.
#items in our dataset to delete contain the following words. I applied some judgement here to eliminate results that represented venues that
# do not appeal to men including places that start with women's names, foreign words, and frilly adjectives.
# The objective is to have a list of hair salons that men would go to.
sublist = ['Aveda', 'Diva', 'Kids','kids','kidz','Kidz', 'Cheeky', 'School',
'beauty','Beauty','Tanning','tanning','Makeup', 'Guity', 'Ignition',
'makeup','Nails','nails','Nails','brows','Brows','nail','Nail',
'lash','Ana','2379 ne loop 410','glam','Glam','Extensions','extensions','4475 walzem',
'Spa','spa','Lounge','Blow Dry','Allure','Alta','Eyebrow','eyebrow','Body','body',
'Wax','wax','blow','Blown','Bliss', 'Young', 'Volum','Lash','lash','Anna','Regis',
'Avalon', 'Clinic', 'Anila','Alice','Threading','Bella','Blondes','Capri','Cleo',
'Hairbraiding','Delia','Braiding','Esmereld','De','Forever','Foxy','Skin','Beauties','Artistry',
'le','Le', 'Lucia','Lynda','Martha','Melody','Cosmetology','Miriam','Monica','Moxie','Design',
'Posh','Purple','Raquel','Roxana','Sajoir','Avaja','Bellezza','Snow','Sola','Sophia','Stardust',
'Talina','Tangled','Brow','Weave','Trevi','Twirl','UCAS','Versi','Elegance','Voglia','Vogue','Car','Wigs','SPA',
'faith','rosie','petra','mommys','Jacquelines','Color','Imago','Esmarelda','Kayla','Kouture','Medusa','Milan','Miss',
'Neema', 'Boutique','Rene','Cadiz','Hideaway','Identity','Meraki','Platinum','Revelation','Syzygy','Hair Pieces','Neo-graft',
'TR3S', 'Tanfastic','Tease', 'Colour','New You', "Tiffany's", "Tracy's",'Otomi','de','by JC','by jc','By JC','Neograft'
]
results = [] #create a list to append to
for string in sublist: #iterate through the items in sublist
res = list(filter(lambda x: string in x, namelist)) #find place names that contain keywords in our sublist
results.append(res) #appends a list to our results list, creating a list of lists
#flatten the list of lists to a single lists
flat_results = []
for sublist in results:
for item in sublist:
flat_results.append(item)
#Create a series for future iterating out of the flat list
flat_results =pd.Series(flat_results)
#dropping the items that are filtered out of the dataframe
for item in flat_results:
SATXbarbers.drop(SATXbarbers[SATXbarbers.Venue_Name == item].index, inplace=True)
SATXbarbers.drop('ZIP_Searched',axis=1,errors='ignore')
SATXbarbers.drop_duplicates(subset=['Latitude','Longitude'],inplace=True) #Drop the duplicates
print(SATXbarbers.shape)
SATXbarbers.reset_index(drop=True, inplace=True)
SATXbarbers.drop('index',axis=1, inplace=True, errors='ignore')
SATXbarbers.head(10) #all tables displayed as head(10) for cleaner HTML presentation
Initially, we have 2414 responses, but we've narrowed down our list to 417.
At this point, we've eliminated duplicates and eliminated hair salons that cater to women based off of clues in their name. Now, we are going to classify each barber shop as 'Premium' which means specific competition for us, or as 'Economy' which is just general competition. We are doing this so we can map out these places and locations. The premium places should map on our darkly shaded zip codes, and thus validate the methodology up to this point.
To label a barbershop as premium, we once again will look for hints in the name. I then do a google search to see if the place is premium based upon unique style, availability of straight razor shaves, etc. Another indicator of 'premium' is 2-5 locations, so that cues me to check Google for their services.
premium = ['Fine', "Men's", 'Matador', 'Tuneup', 'Boardroom','Champs','Diesel',
'Downtown','Executive',"King's",'Knockouts',"Rob's",'The Art', 'Gents', 'The Barbershop',
'The Good Barber','Tune Up', 'Manly', 'Urban', 'Urbancity', 'Veterans',"Veteran's",'all american',
'nyc']
# This adds a premium column that indicates if a venue is premium or not
SATXbarbers['Premium'] = np.where(SATXbarbers.Venue_Name.str.contains('|'.join(premium)),'Yes','No')
SATXbarbers.head(10) #all tables displayed as head(10) for cleaner HTML presentation
# Split our original dataframe into two dataframes
PremiumBarbers = SATXbarbers[SATXbarbers['Premium'] == 'Yes']
PremiumBarbers.reset_index(drop=True, inplace=True)
PremiumBarbers
EconomyBarbers = SATXbarbers[SATXbarbers['Premium'] == 'No']
EconomyBarbers.reset_index(drop=True, inplace=True)
EconomyBarbers.head(10) #all tables displayed as head(10) for cleaner HTML presentation
Now that we have a dataframe for our economy barbershops and premium barbershops, lets put markers on our map. The aim of this is as follows:
newlabels = {5:1,
0:2,
2:3,
4:4,
1:5,
3:6} #New Cluster Labels, where Higher labels indicate better zones to locate the business
newZIP = newZIP.replace(to_replace=newlabels, value=None, inplace=False) #replace the labels for the dataframe for plotting
newZIP.head(10) #all tables displayed as head(10) for cleaner HTML presentation
import folium
from folium import plugins
import os
SAmap2 = folium.Map(location=[latitude, longitude], zoom_start=10)
SAmap2.choropleth(
geo_data='SATX_zip_codes_geo.json',
data=updated_ZIP_demo_kmeans,
columns=['ZIPCode','Cluster_Labels'],
key_on='feature.properties.ZCTA5CE10',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Cluster Labels',
reset=True,
name='Demographic Shading',
highlight=True
)
# add markers to show ZIP codes
# Create a layer for the ZIP code labels
fg1 = folium.FeatureGroup(name='ZIP Code Labels')
SAmap2.add_child(fg1)
# Create the ZIP code markers/labels and add to the feature group
for lat, lng, postalcode, cluster in zip(newZIP.Latitude.astype('float'),
newZIP.Longitude.astype('float'),
newZIP.ZIPCode.astype('int'),
newZIP.Cluster_Labels):
label = '{},{}'.format(postalcode,cluster)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(fg1)
# add markers to show Premium Barbershops
# Create a layer for the Premium Barbershops
fg2 = folium.FeatureGroup(name='Premium Barbershops')
SAmap2.add_child(fg2)
# Create the Premium Barbershop markers and add to the feature group
for lat, lng, venue in zip(PremiumBarbers.Latitude.astype('float'), PremiumBarbers.Longitude.astype('float'), PremiumBarbers['Venue_Name']):
label = '{}'.format(venue)
label = folium.Popup(label, parse_html=True)
folium.Marker(
[lat, lng],
popup=label).add_to(fg2)
# Create a layer for the Economy Barbershops
fg3 = folium.FeatureGroup(name='Economy Barbershops')
SAmap2.add_child(fg3)
# Create the Economy Barbershop markers and add to the feature group
for lat, lng, venue in zip(EconomyBarbers.Latitude.astype('float'), EconomyBarbers.Longitude.astype('float'), EconomyBarbers['Venue_Name']):
label = '{}'.format(venue)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='green',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(fg3)
folium.LayerControl().add_to(SAmap2)
#Save our created map
delay=5
fn = 'SATX_Barbershops.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
SAmap2.save(fn)
SAmap2
Awesome!!! Now we have a map for human visual analysis. This map has layers for de-cluttering, so along with the zoom feature we should be able to deduce quite a lot from this product.
General Observations and efficacy of the model: Cluster 5 by far houses the highest number of premium barbershops which means that, subjectively, the k-means clustering seemed to work. Furthermore, a simple majority of the premium barbershops are adjacent to Cluster 6 zip codes. This goes to show that applying some basic demographic and economic data at scale and then working backwards to interpret the Clusters produced a map model that provides useful insight against the business problem. This largely validates our methodology.
Observation 1: There are very few premium barbershops in our darkest clusters (which is meant to highlight suitability for a premium barbershop). However, many premium barbershops appear to be located on the edge of these dark clusters so that hints that these areas may still be associated with customers. It may also be the case that not all the dark clusters are not well classified and that there is high variability within the cluster. Lastly, it could be that rents are very high in these areas for businesses that are prohibitively high for even premium barbershops. We'll look more closely into Cluster 6 to attempt to explain this observation.
Observation 2: The most promising Zip code for starting a new premium barbershop appears to be 78222, which is the along the San Antonio outer loop. It's a cluster 6 area adjacent to a cluster 4 and cluster 5. Also, The second most promising Zip code would be 78250, which is a cluster 5 zip code adjacent to 5 other cluster 5 zip codes.
Areas of Further Analysis:
Cluster6 = updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['Cluster_Labels'] == 6] #Cluster 6 zip data
Cluster6AVG = Cluster6.groupby('Cluster_Labels').mean().reset_index() #create a row of average cluster data for comparision
Cluster6med = Cluster6.groupby('Cluster_Labels').median().reset_index() #create a row of median cluster data for comparision
# Add the 3 dataframes together
Cluster6df = pd.concat([Cluster6,Cluster6AVG]).reset_index()
Cluster6df.iloc[8,1] = 'AVG CL6'
Cluster6df2 = pd.concat([Cluster6df,Cluster6med]).reset_index()
Cluster6df2.iloc[9,2] = 'MED CL6'
Cluster6df2.drop(['level_0','index','Cluster_Labels'],axis=1,errors='ignore',inplace=True)
Cluster6df2.drop(['MedianAgeMale','AverageFamilySize','AVG_Paycheck'],axis=1,errors='ignore',inplace=True)
plt.figure(figsize=(20,10))
sns.set(style="whitegrid", palette="muted",font_scale=2)
# "Melt" the dataset to "long-form" or "tidy" representation
melted_clustered = Cluster6df2.melt(id_vars='ZIPCode', var_name='ZIP Code', value_name='Features', col_level=None)
# Draw a categorical scatterplot to show each observation
sns.catplot(x="ZIP Code", y="Features", hue="ZIPCode",
data=melted_clustered, kind="point", height=10, aspect=2);
Looking at the above graph, we can observe that 78219 (Eastside by Army Heliport), 78264 (Southern Exurb), and 78204 (Downtown) are outliers within Cluster 6.
Because of the presence of these outliers, we can conclude that a feature not present in our model is effecting the formation of premium barbershops in Cluster 6 Zip codes, and we also know that the 3 zip codes described above are not in fact good places to locate a premium barbershop.
Additionally, the orange line in the graph above cues us to look more closely at zip code 78218. Looking at the map, there are 3 premium barbershops in adjacent zip codes, but none in the actual zip code itself.
78218 might be a great zip code to locate a new premium barbershop.
Zip code 78222 is the purple line in the above graph. This location looks mediocre now that we've seen its stats more closely.
The last target area of interest is 78250 located in the western part of the city. The adjacent zip codes to 78250 are 78254, 78251, 78253,78240.
updated_ZIP_demo_kmeans['ZIPCode'].astype(str)
SAwest = pd.DataFrame(None)
SAwest = SAwest.append(updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['ZIPCode'] == '78250'])
SAwest = SAwest.append(updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['ZIPCode'] == '78254'])
SAwest = SAwest.append(updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['ZIPCode'] == '78251'])
SAwest = SAwest.append(updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['ZIPCode'] == '78253'])
SAwest = SAwest.append(updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['ZIPCode'] == '78240'])
SAwest
Cluster5 = updated_ZIP_demo_kmeans[updated_ZIP_demo_kmeans['Cluster_Labels'] == 5]
Cluster5avg = Cluster5.groupby('Cluster_Labels').mean().reset_index()
Cluster5avg
try: del SAwest1
except: None
SAwest1 = pd.concat([SAwest,Cluster5avg]).reset_index()
SAwest1.drop(['index','Cluster_Labels'],axis=1,errors='ignore',inplace=True)
SAwest1.iloc[5,0] = 'Cluster 5 Average'
SAwest1
SAwest1['adjusted_Paycheck'] = SAwest1['AVG_Paycheck'].map(lambda x: x*500)
SAwest2 = SAwest1.drop(['MedianAgeMale','AverageFamilySize','AVG_Paycheck'],axis=1,errors='ignore',inplace=False)
plt.figure(figsize=(20,10))
sns.set(style="whitegrid", palette="muted",font_scale=2)
# "Melt" the dataset to "long-form" or "tidy" representation
melted_clustered = SAwest2.melt(id_vars='ZIPCode', var_name='ZIP Code', value_name='Features', col_level=None)
# Draw a categorical scatterplot to show each observation
sns.catplot(x="ZIP Code", y="Features", hue="ZIPCode",
data=melted_clustered, kind="point", height=10, aspect=2);
Only one of the 5 zip codes has below average values for Cluster 5. Because 78250 is in the middle of our cluster 5 zip codes, it looks like a great place to open a premium barbershop.
The mitigating factor with this location though is that there is some competition in the adjacent area, but none in the immediate zip code itself.
78218 and 78250 and their surrounding areas are the best places to open a premium barbershop. Of course, this analysis does not include street level knowledge, knowledge of local business rents, or availability of commercial zoned space for rent.
This data-driven study set out to solve a problem for premium barbershop entrepreneurs: Where should I set up my next shop? Along the way, this study also was able to provide answers for, “Where is my competition located?” The results of this study have narrowed down the search area that an entrepreneur would need to do from a city of 61 zip codes to 2 prospective zip codes, saving time and providing value to the entrepreneur. Of course, this study doesn’t account for all variables that go into opening a business in a certain location, but it does make finding the right place easier.